Celestin Apprentice 4

home *** CD-ROM | disk | FTP | other *** search

/ Celestin Apprentice 4 / Apprentice-Release4.iso / Source Code / Add-Ons / MPW / MPW noweb 2.7 / examples / scanner.nw (.txt) < prev next >

Wrap

LaTeX Document | 1995-05-30 | 14.9 KB | 368 lines | [TEXT/MPS ]

% Norman Ramsey (nr@notecnirp) --- CS 320 \def\sizedboxit#1#2{\vtop{\vbox{\hrule\hbox{\vrule\kern #2% \vtop{\vbox{\kern #2\hbox{#1}}\kern #2}\kern #2\vrule}}\hrule}} \def\boxit#1{\sizedboxit{#1}{1pt}} \def\token#1{\boxit{\strut\tt #1}} \setcounter{secnumdepth}{0} % l2h ignore boxit % l2h argblock token tt In this assignment I print intermediate code. <<grammatical C functions>>= void print6 (Exp e) { extern int yylineno; printf("line %d:\n",yylineno); if (e.type==error_type) printf("# ERROR\n"); else printTree(e.tree,stdout); <<grammatical C declarations>>= void print6(Exp e); \section{The {\tt yacc} value stack} Here are all the objects I use as synthesized attributes: <<grammatical declarations>>= %union { char *string; Type type; Product product; Symbol symbol; Exp exp; Explist explist; int bit; \section{3. Vocabulary and representation} \subsection{Representation of terminal symbols} {\tt yacc} and {\tt lex} use integers to represent terminal symbols (tokens). Single-character tokens are represented by their ASCII codes. Longer tokens are declared using {\tt yacc}'s [[%token]] declaration; {\tt yacc} chooses an integer representation of each such token. These representations are made available to the lexer via [[y.tab.h]]. I use the standard trick from Kernighan and Pike ([[x.tab.h]] instead of [[y.tab.h]]) to avoid remaking the lexer unecessarily. <<lexical include files>>= #include "x.tab.h" \subsection{1. Identifiers} Since {\tt lex} is notoriously slow at using patterns to recognize reserved words, I look up every identifier in a table of reserved words. [[idcategory(id)]] returns the category of the reserved word [[id]] when [[id]] is in fact a reserved word. When [[id]] is not a reserved word, [[idcategory(id)]] returns [[IDENT]]. <<lexical definitions>>= letter [A-Za-z] digit [0-9] The offensive [[<INITIAL>]] below has to do with handling comments (q.v.). Both identifiers and reserved words are saved in the string table and put on the {\tt yacc} value stack. <<lexical rules>>= <INITIAL>{letter}({letter}|{digit})* { yylval.string = strsave(yytext); return idcategory(yytext); } <<lexical C declarations>>= extern char *strsave(char *s); <<grammatical declarations>>= %token IDENT %type <string> IDENT <<grammatical rules>>= ident : IDENT ; <<grammatical declarations>>= %type <string> ident <<grammatical C declarations>>= extern char yytext[]; \subsection{2. Numbers} I have to use two different [[ScaleFactor]] definitions so I can tell the difference between long and short reals. Notice that the scale factor is {\em not} optional for the long real. <<lexical rules>>= <INITIAL>{digit}+|{digit}({hexDigit}*)[Hh] { yylval.string = strsave(yytext); return INTEGER; } <INITIAL>{digit}+"."{digit}*{EScaleFactor}? { yylval.string = strsave(yytext); return REAL; } <INITIAL>{digit}+"."{digit}*{DScaleFactor} { yylval.string = strsave(yytext); return LONGREAL; } <<grammatical declarations>>= %token <string> INTEGER REAL LONGREAL I permit lower case where Wirth insists on upper case. This will be convenient later on. Besides, Hanson does it. I need parentheses around things like [[EScaleFactor]] because I'n not really defining a regular expression---I'm using a macro facility. <<lexical definitions>>= hexDigit [0-9A-Fa-f] EScaleFactor ([eE][+\-]?{digit}+) DScaleFactor ([dD][+\-]?{digit}+) \subsection{3,4. Strings and character constants} Single character strings [["x"]] become character constants, not strings, thanks to the {\tt lex} disambiguation rules. <<lexical rules>>= <INITIAL>"\'"{nonquote}"\'" { yylval.string=strsave(yytext); return CHARCONST; } <INITIAL>{digit}{hexDigit}*[Xx] { yylval.string=strsave(yytext); return CHARCONST; } <INITIAL>"\""{nondoublequote}*"\"" { yylval.string=strsave(yytext); return STRING; } <<grammatical declarations>>= %token <string> CHARCONST STRING The character set for string literals is awkward, because I want to include backslash escapes. I use the ANSI standard backslash escapes from section A2.5.2 of the second edition of Kernighan and Ritchie. Because the lexical analyzer is probably not the right place to handle illegal backslash escapes, I allow any reasonable character to follow the backslash. I define [[nonoctalx]] to be those characters that can't start an octal or hexadecimal constant (when following a backslash). Then I can recognize octal and hexdecimal character constants like [["\012"]]. I {\em don't} insist that at least one [[hexDigit]] follow [[\x]], because again that should be handled downstream of the lexical analyzer. <<lexical definitions>>= plainnonquote [ \t\]\"!#$%&()*+,\-./0-9:;<=>?@A-Z[^_`a-z{|}~] plainnondoublequote [ \t\]\'!#$%&()*+,\-./0-9:;<=>?@A-Z[^_`a-z{|}~] escapedchar (\\({nonoctalx}|{octal}{octal}?{octal}?|x{hexDigit}*)) nonoctalx [ \]\'\"!#$%&()*+,\-./89:;<=>?@A-Z[\\^_`a-wyz{|}~] nonquote ({plainnonquote}|{escapedchar}) nondoublequote ({plainnondoublequote}|{escapedchar}) octal [0-7] I also need to handle strings that don't terminate. When I see one, I gobble up the whole line on which it sits---that should make it easier for the parser to recover. (The alternative is trying to tokenize the characters following the open quote.) <<lexical rules>>= <INITIAL>("\""{nondoublequote}*|"\'"{nonquote}*) <<complain; return bad string>> I print the first few characters of a nonterminated string, followed by an ellipsis. I return the string anyway because that way there's a chance that the parser can just ignore the error. <<complain; return bad string>>= { yyerror("Unterminated string %.8s%s",yytext,yyleng>8?"...":""); yylval.string=strsave(""); return STRING; I include prototypes for the string functions, to keep {\tt lcc -A} from complaining about missing prototypes. <<common include files>>= #include <string.h> \subsection{5. Operators and delimiters} Here are [[%token]] declarations for all the multicharacter tokens, including the reserved words. I use [[yyBEGIN]] because [[BEGIN]] means something special to {\tt lex}. <<grammatical declarations>>= %token ARROW INC DEC LSHIFT RSHIFT LEQ GEQ EQ NEQ AND OR /* -> ++ -- @<< @>> <= >= == != && || */ I make sure the lexer always returns strings for identifiers and reserved words. Recognizing the operators and delimiters is straightforward: <<lexical rules>>= <INITIAL>"->" return ARROW; <INITIAL>"++" return INC; <INITIAL>"--" return DEC; <INITIAL>"<<" return LSHIFT; <INITIAL>">>" return RSHIFT; <INITIAL>"<=" return LEQ; <INITIAL>">=" return GEQ; <INITIAL>"==" return EQ; <INITIAL>"!=" return NEQ; <INITIAL>"&&" return AND; <INITIAL>"||" return OR; <INITIAL>[\]+!\-*/~&.,;|()[{}^=#<>:] return *yytext; \paragraph{Reserved word search} Recall that, instead of having the {\tt lex}-generated automaton recognize reserved words, I wanted to look up each identifier to see if it is a reserved word. I put the reserved words into an array and use binary search to find their categories. A word that isn't in the list has category [[IDENT]]. The list itself is excruciating to read. I use a trick I got from Dave Hanson---I put the list in a header file as calls to the [[kw]] (keyword) macro. Then I include the header with appropriate macros wherever I need it. <<list of reserved words>>= kw(INT, "int") A binary search table of reserved words: <<reserved word data structures>>= static struct reserved {char *s; int category;} reservedwords[] = { #define kw(VAL,STRING) {STRING,VAL}, <<list of reserved words>> #undef kw static int numreservedwords = (sizeof(reservedwords)/sizeof(struct reserved)); [[idcategory]] is just binary search. <<lexical C functions>>= <<reserved word data structures>> static int idcategory (char *id) { int low=0, high=numreservedwords-1, mid; int compare; while (low <= high) { /* Invariant: if id is in the initial range low...high, it is in the current range low...high */ mid = (low+high)/2; /* note low <= mid <= high */ compare = strcmp(id,reservedwords[mid].s); if (compare>0) low = mid + 1; else if (compare<0) high = mid - 1; else return reservedwords[mid].category; } return IDENT; /* id is not a reserved word */ <<lexical C declarations>>= static int idcategory(char *); \subsection{Comments} I use the standard trick of defining a special state just for comments. A begin comment sends the lexer into state [[<COMMENT>]], and an end comment returns it to state [[<INITIAL>]]. All tokens that aren't comments are recognized only in state [[<INITIAL>]], which explains the offensive [[<INITIAL>]] prepended to every rule. <<lexical definitions>>= %S COMMENT <<lexical rules>>= <INITIAL>"/*" BEGIN COMMENT; <COMMENT>"*/" BEGIN INITIAL; <COMMENT>"\n" ; <COMMENT>. ; \subsection{Whitespace and bad characters} <<lexical rules>>= <INITIAL>[ \t\n]+ ; /* ignore whitespace */ <INITIAL>. <<complain about a bad character>> The error message we print is different for printable and nonprintable characters. <<complain about a bad character>>= { if (*yytext >= ' ' && *yytext < 0177) yyerror("bad character `%c'", *yytext); else yyerror("bad character `\\%03o'", *yytext); \section{8. Expressions} \subsection{8.1 Operands (designators and constants)} There are no qualified identifiers, so this simplifies the parsing of designators. It is a bit awkward to distinguish variables and parameters from constant identifiers. There is also an awkwardness with rvalues---boolean expressions have to be converted from ``test'' to values using [[BOOL]]. Following a suggestion of Hanson's, I use three nonterminals: [[exp]] is an expression (possibly a test); [[rvalue]] is an rvalue (never a test), and [[lvalue]] is an lvalue. I introduce [[complex_lvalue]] because I need to distinguish identifiers from all other lvalues (otherwise I get a reduce-reduce conflict when converting lvalues to expressions). <<grammatical declarations>>= %type <exp> arraydes lvalue complex_lvalue rvalue exp Here are productions for all the C literals. I use [[make_constval(type,string)]] to convert a string to a value of the type desired. I issue warnings for long reals, since they aren't supported in Oberon/320. <<grammatical rules>>= exp : INTEGER { $$ = make_constval(integer_type,$1); } | REAL { $$ = make_constval(real_type,$1); } | LONGREAL { warning("Long reals not supported (replaced with zero)"); $$ = make_constval(real_type,strsave("0.0")); } | CHARCONST { $$ = make_constval(char_type,$1); } | STRING { $$ = make_constval(build_array(0,stringchar_type),$1); } \subsection{8.2 Operators} I use {\tt yacc} precedence declarations. These declarations define precedence. My task is much simplified because unary and binary [[-]] have exactly the same precedence. <<grammatical declarations>>= %left OR %left AND %left '|' %left '^' %left '&' %left EQ NEQ %left '<' '>' LEQ GEQ %left LSHIFT RSHIFT %left '+' '-' %left '*' '/' '%' %right '!' '~' INC DEC /* unary operators */ %left ARROW '.' [[binop]] checks types and generates intermediate code. Consult the chapter on typechecking for the description of the correct operation of [[binop]] and the meanings of various permissions [[p_xxx]]. <<grammatical rules>>= complex_lvalue : lvalue '.' ident { $$ = find_field($1,$3); } | exp ARROW ident { $$ = find_field(deref($1),$3) } | '*' rvalue { $$ = deref($1); } | arraydes ']' { $$ = $1; } arraydes : rvalue '[' exp { $$ = subscript($1,$3); } lvalue : complex_lvalue { $$ = $1; } | ident { $$ = lookup_lvalue($1); } exp : complex_lvalue { $$.type=$1.type; $$.tree=tMEM($1.type->size,$1.tree); } | ident { $$ = lookup_exp($1); } rvalue : exp { $$ = boolval($1); } <<grammatical rules>>= exp : exp EQ exp { $$ = binop(OSeq, $1,$3,boolean_type,p_equality); } | exp NEQ exp { $$ = binop(OSneq,$1,$3,boolean_type,p_equality); } | exp '<' exp { $$ = binop(OSlt, $1,$3,boolean_type,p_relational); } | exp LEQ exp { $$ = binop(OSleq,$1,$3,boolean_type,p_relational); } | exp '>' exp { $$ = binop(OSgt, $1,$3,boolean_type,p_relational); } | exp GEQ exp { $$ = binop(OSgeq,$1,$3,boolean_type,p_relational); } exp : exp '+' exp { $$ = binop(OSplus, $1,$3,0,p_numeric); } | exp '-' exp { $$ = binop(OSminus,$1,$3,0,p_numeric); } | exp OR exp { $$ = binop(OSor, $1,$3,0,p_boolean); } exp : exp '*' exp { $$ = binop(OSmul, $1,$3,0,p_numeric); } | exp AND exp { $$ = binop(OSand,$1,$3,0,p_boolean); } exp : '(' exp ')' { $$ = $2; } \subsection{Function calls} Calls to functions (procedures having a return type) may occur {\em only} as factors in the production given below. bottom of p.~678 of the Oberon report makes it clear that the~[[()]] are required even if the function has no parameters. <<not yet grammatical rules>>= exp : ident ActualParameters { $$ = fcall($1,$2); } ActualParameters: '(' ExpList ')' { $$ = $2; } | '(' ')' { $$ = 0; } ExpList : rvalue ',' ExpList { $$ = explist($1,$3); } | rvalue { $$ = explist($1,0); } <<not yet grammatical declarations>>= %type <explist> ActualParameters ExpList \section{11. Compilation units} <<grammatical rules>>= module : exp { compile($1); } <<grammatical declarations>>= %start module \section{Error recovery} Here are some simple error productions that might help the parser continue. The first four gobble up mangled declarations. The last two are stabs in the dark; I hope they give the parser a chance to recover from errors in statements and expressions. <<grammatical rules>>= exp : error { $$.type = error_type; } \section{Putting it all together} Here are the necessary {\tt\#include} files: <<common include files>>= #include <assert.h> #include <stdio.h> #include "types.h" #include "predef.h" #include "tree.h" #include "typecheck.h" #include "codegen.h" #include "symbol.h" #include "declarations.h" #include "constants.h" #include "errors.h" There are no include files used exclusively by the parser. This is because the lexer sees {\tt y.tab.h}, and so has to know everything the parser knows. <<grammatical include files>>= This is boilerplate for every {\tt lex} specification ever written: <<lexer>>= <<common include files>> <<lexical include files>> <<lexical C declarations>> <<lexical definitions>> <<lexical rules>> <<lexical C functions>> And this is boilerplate for every {\tt yacc} specification ever written: <<parser>>= <<common include files>> <<grammatical include files>> <<grammatical C declarations>> extern int yylex(void); <<grammatical declarations>> <<grammatical rules>> #define lint /* keep lcc from barking about no reference to yaccpar_sccsid */ <<grammatical C functions>> \section{Indices} \subsection{Chunks} \nowebchunks \subsection{Identifiers} \nowebindex